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ABSTRACT 

This study was designed to increase knowledge of the 
functioning of item bias techhiques in detecting biased items. 
Previous studies have used computer-generated data or real data with 
unknown amquats of bias. The present project extends previous studies 
by using items that are logically generated and subjectively 
evaluated a priori to be biased or unbiased, and simultaneously 
controls the amount of bias and true ability differences {as measured 
by the unbiased items). The study evaluated the functioning of four 
statistical methods of assessing test item bias (transformed item 
difficulties, chi-square, three parameter and one parameter item 
characteristic curves) when (1) tests have vary ing amounts of bias (0 
biased/60 items, 18 biased/78 items, 40 biased/100 items) and (2) 
ability differences on the unbiased items were either one-half or one 
standard deviation apart. Results indicate that agreement among the 
methods and between the statistical methods and judged bias was 
generally high except for data set VI (40 percent biased items, one 
standard deviation difference). Problems with individual methods and 
with cutoffs are discussed. Finally, presence of biased items did not 
affect reliability but did decrease validity and increase score 
differences between groups. (Author/CM) 
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ABSTRACT 

This study was designed to increase „knowl edge of the functioning 
of item bias techniques in detecting biased items. Previous studies have 
used computer-generated data or real data with unknown amounts of "bias.. 
The present project extends previous studies by using items that are 
logically generated ( and subjectively evaluated a priori to be biased or 
unbiased, and simultaneously controls the amount of bias and true ability 
differences (as measured by the unbiased items). 

The study evaluated the functioning of four statistical methods of 
assessing test item bias (transformed item difficulties, chi-square, three 
parameter and one parameter item characteristic curves) when (1) tests 
have varying amounts of bias (0/60 items, 18 biased/78 items, 40 biased/ 
100 items) and (2) ability differences on the unbiased items were either 
one half or one standard deviation apart.- 

Results indicate that agreement among the methods and between the 
statistical methods and judged bias was generally high except for data 
set VI (40% biased items, one standard deviation difference). Problems 
with individual methods and with cutoffs are discussed. Finally, pres- 
ence of biased items did not affect -reliability but did decrease 
validity and increase score differences between groups. 
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• , ITEM BIAS TECHNIQUES WHEN AMOUNT OF BIAS IS VARIED 
AND SCORE DIFFERENCES BETWEEN GROUPS ARE PRESENT 

The issue of bias in measurement and selection is an important one 
in allowing equal opportunity for persons of equal ability regardless 
of whatever disadvantaged group to which they may belong. Since tests 
are increasingly used as devices for evaluation and placement, test con- 
structors must make every effort to remove bias from them. Minority 
•groups claim that traditional education and employment tests may not 

be measuring their true ability since the tests are based on the cul- 

> 

tural 'experiences of the white middle class ('Williams, 1970, 1971). 

In recognition of this problem, two conferences (National Institute 
of Education, 1975; U.S. .Office of Education, 1976) were held addressing 
questions of bias. Io addition, sessions at several national organiza- 
tions (AERA, APA, NCME) were devoted to examining items for bias in 
J.978 and 1979. The literature has proliferated in the last few years 4 
(for example, JEM, 1976) and has followed two main streams of inquiry: 
(1) Bias in selection covering predictions made by a test in the presence 
of an external criterion; (2) item bias studied in the absence of an 

v 

external criterion (which would be most useful during test development). ' 

The present study attempts 'to address some questions not yet ex- 
plored by recent research in the area of -item bias. Previous studies 
have used computer generated data or real data with unknown amounts of 
bias. This study proposes to find out which method is best under a 



variety of conditions aimed at simulating various features of realis'tic 
conditions. This information is essential because the methods differ 
widely in terms of cost, sample size required, and ease of implementa- 
tionv 

, PURPOSE 

The literature on item bias contains several excellent review 
(Merz, 1978; Peterson, 1977; Rudner, Getsoh, 5 Knight, 1980). Various 
methods that have been explored include: (1) Analysis of Variance (Car- 
dall P, Coffman, 1964; Cleary ft Hilton, 1968); (2) transformed item diffi- 
culties (Angoff 5 Ford, 1973); (3) discrimination measures (Green 5 
Draper, 1972; Ozenne, Van Gelder, f, Cohen, 1974); (4) item character- 
istic curves; (Ironson, 1982; Lord, 1977; Wright, Mead, f, Draba, 1976); 
(5) chi-square (Scheunema'n, 1979); (6) multivariate, factor structures 
(Green, 1976; Green S Draper, 1972; Merz,' 1973, 1976a); (7) response 
foil approach (Veale 6 Foreman, 1976). 

Recent research has attempted to examine how effectively these 
methods identify biased items and the concordance among the methods. 
Ironson and Subkoviak (1979) found support for the three parameter 
item characteristic curve approach, the chi-square procedure, and the 
transformed item difficulty procedure. Rudner, Getson, and Knight 
(1979) found most support for the three parameter procedure, a chi- 1 
square procedure using five intervals, and a transformed difficulty 
approach. ; Merz and Grossen (1979) favored tne transformed item diffi-, 
culties procedure. 



These recent studies on item bias as well as previous ones have 
either used existing data 6ets where the amount of bias is unknpwn a 
priori (Ironson 5 Subkoviak, ; 1979; Nungesfcer, 1977; Rudner 4 Convey, 
1978;- Scheuneman,' 1975, 1977) or. Monte Carlo procedures where bias <■ 
was statistically generated by a computer (Merz $ Grossen, 1968; Rud- 
ner, Get soil S Knight, 19-79). Although these computer studies have been 
able to control the amount of bias in' test analysis, these studies 
have defined bias according* to an arbitrary choice of model (the item 
characteiisfic curve is the one frequently used) „and the data are of 
necessity artificial. 

A further problem with detecting item bias is that measuring the 
bias in items against an internal criterion of tjie test as a whole is 
.only logically valid to the' 'extent', that \the test as a whole is consid- 
ered to be JLess biased than the individual items. When attempting to 
control for ability,' differences, the methods dp so with biased items 
used- to measure the ability which* is used to measure the'|ias in items. 
^Thus, this whole circular process confounds ability and bias and is 

likely to be affected by the proportion of biased items. This problem 

i 

of circularity is particularly important in minority testing where 
observed -differences in test scores of, one standard deviation have 
been found (Linn, 1973) and ability difference's and differences due 
to bias are confounded to an unknown degree. 

The .present study extends previous research by having the realism 



of an actual data set but in more controlled analysis situations. 
Males and females were chosen for .study for several reasons, the most 
important of which are that: 

1. The)study was not designed to examine CONTENT bias against 
any particular group; it was designed to test which METHODS are func- 
tioning properly; and 

2. The stxidy COULD NOT answer the questions about the sufficiency 
of the methods if blacks and whites were used. This important .design 
issue is discussed further later. 

The conditions of the study, however, were chosen to have direct 

'A 

relevance to minority testing. The present study addresses several of 
the issues raised herein by: 

1. Having items that are logically generated and evaluated a 
priori to be "biased" or "unbiased"; 

2. Analyzing tests composed of specified proportions of bias 
rather than having the amount of bias unknown; and 

3. Selecting samples so that observed test score differences are 
one standard deviation apart (to simulate black/white differences), 
but when these observed' differences are a result of known amounts of 
combinations of ability differences and differences due to bias. 

The study compares the efficacy of four methods in identifying 
bias items in each of the conditions of 2 and 3 above. The four 
methods chosen for study were the transformed item difficulty approach, 



the chi-square procedure, the three parameter item characteristic 
curve procedure, and the one parameter item characteristic curve pro- 
cedure. These were chosen for study since they showed promise from 
previous studies. 

In addition to the theoretical question, an important practical 
question was being addressed as the methods differ widely in cost, 
sample size required, and sophistication required to understalW and 
implement the method. The three parameter method is Very costly, 
requires very large sample sizes, and sophisticated background knowledge 
On the other end of the continuum is the transformed item difficulty 
procedure. This procedure can easily be implemented, requires a much 
smaller sample size, and requires less mathematical sophistication to 
understand. 

Thus, -the "study is designed to determine which method of detecting 
biased itoms is best and under what conditions. 

# 

REVIEW OF ITEM BIAS METHODS 
Example of a biased item. Suppose, as part of a general informa- 
tion test, a Canadian is asked to answer the question: "How many 
senators are there in the U.S. Congress?" This question would likely 
be regarded as- biased against Canadians because, according to one popu- 
lar definition of bias, essentially equal ability Canadians and 
Americans would have an unequab chance of getting this item correct. 



If we embedded this item in a larger test of general information, it 
might be identified statistically by^the various procedures described 
below. < \ 

*• 

Transformed item difficulty . In this approach, an item is con- 

■ — ^ — ' 

sidered biased if for a given group it is relatively more difficult 
than other items on a test. The first step in this procedure involves 
calculating the item difficulty or p_-value (proportion^ of subjects get- . 
ting the item correct) for each of the two groups on each of the items. 
The p_-values are then transformed into normal deviate Rvalues; i.e., 
2 is the tabled value having proportion (l-p_) of the normal distribution 
below it. Then a delta value 4Z + 13) is calculated from the 

tabled Z to eliminate negative Z values, so that a large delta value 
indicates a difficult item. The pairs of transformed delta values (one 
pair for each item) are plotted on a bivariate graph, each pair being 
represented by a point on the graph. 

The plot of these points appears as an ellipse extending from 
lower left to upper right. In order to identify the biased items, 
it is necessary to determine the major axis of the ellipse and the dis- 
tance of the items from that axis. Those items that are relatively, 
more difficult for one group than another fall at some distance from 
the axis and are identified as biased. The equation to be used for the 
major axis 'of the ellipse is given by Angoff and Ford (1973, p. 98). 
If th,e major axis is denoted by Y « AX + B, then the formula for the 



perpendicular distance Di, of each pointy i, in the plot to the line 
is given as:\ ; 



AX.- Y. + B 
1 i . 



Pi = 



V + i 



This perpendicular distance is a measure of the relative difficulty of 
an item and thus is a measure of the item's bias. 

Chi-square . Two chi-square type procedures were calculated. The 
first considers only proportion correct and if. therefore not a true 
chi-square statistic. According to Scheuneman (1975), "An item is un- 
biased if, for all individuals having the same score on a homogeneous 
subtest containing the item, the proportion of individuals getting the " 
item correct is the same for each population group being considered" J 
CP. 2). 

* The first part of the chi-square analysis involves establishing 
ability intervals for each item. This, is accomplished with data from • 
two sets of distributions. < First, standard frequency distributions of 
total test scores are plotted separately for each group, and second, 
bivariate distributions of the number -of correct responses to each 
test item by group. Scheuneman (1979) notes that at least three ability 
levels (calculated from total test scores) must be used for each item 
and there is little to be gained by having more than five intervals. 
Therefore, the study attempts to use five ability levels for each item 
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unless the following two criteria cannot be met (in which case fewer 
than five will be used). First, each sco^e interval must contain five 
or more correct responses from each group. Second, the probability 
of a correct response within a. given interval must be between 0 and 1. 
The second part of the analysis involves the calculation of the 

0 

chi-square value. The degrees of freedom for the test are reduced to 
(a-2)(b-l), where a is the number of ability levels and b is the number 
of groups. The degrees of freedom are (a-2) for the ability dimension 
because both the ability level of the* examinee and the probability of 
a correct response must be estimated from the sample data. In addition, 
the formula for- the expected cell frequencies in this procedure is dif- 
ferent from the standard chi-square procedure. Algebraically, it is: 

E'= A ' y C 
xy ^xy 

B.y 

v 

* where A.y is the number of examinees in ability level y responding 

* i> 

correctly; B.y is the total number of examinees in ability level y; 

C is 'the" total number of examinees in ability level X and group Y. 
xy 

The value of chi-square is calculated, a large chi-square indicating 
much bias. 

The second chi-square type procedure follows the same logic but 
includes- both correct and incorrect responses. It is described fully 
in Ironson (1982) and will be referred to in this report as the full 
chi-square. 

13" 



Three parameter item characteristic curve : An item characteristic 
curve (ICC) specifies the relationship between the probability of an 
examinee answering an item correctly and his ability level (Bimbaum, 
1968; Lord 5 Novick, 1968). The equation for the three parameter 
logistic model is given by Birnbaum (1968): 

P(U g =l/*.)=c g+ ^ ' 

where (U g =l/#) is the probability of a correct response to item g given 
an examinee of ability level 0^ a o is an item discrimination index; b g * 
is an item- difficulty index; c g is a pseudo-guessing parameter. The 
curve for each item is determined from three" parameters (ag, bg, and cg) 
that are estimated by the LOGIST procedure (Wood 5 Lord, 1976; Wood, 
Wingersky, $ L6rd, 1976). This is done separately for each of the two 
groups. However, in the present study, the test is composed of free 
response items so that only the difficulty and discrimination parameters 
need to be estimated. Thus, in the formula above, the cg parameter 
becomes zero. An unbiased item is one whose parameter values and item 
characteristic curves are the same for different ethnic groups, after 
equating (putting the parameter values on the same scale). 

One parameter item characteristic curve : In this model, also referred 
to as the Rasch model, the probability of a correct response is a func- 
tion of an examinee's ability and only one item parameter— difficulty. 
Formulas are given in Wright (1977). Items can first be teste! for fit 
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to the model (Wright $ Mead, 1977) # Bias is measured by a shift in 

difficulty value for an item in the two groups (Draba, 1978; Durovic, 

1975; Wright, Mead, f, Draba, 1976). A t_ statistic is used for this 
purpose : 



where d^ is the difficulty estimate for item i in group 1 and (Sei) is 
the standard error for group 1. If $ is large, an item is relatively 
more difficult for one group and is thus biased. 



male and 590 female undergraduates. The procedure requiring the 
largest number of examinees was the three parameter model using LOGIST. 
The sample size of over 500 is sufficiently above the minimum required 
since the test is long (60 to 100 items) and use of free response data 

means that the "c" parameter does not hav.e to be estimated (Hulin, 

Lissak, P, Drasgow, 1981). 

Males and females were chosen as the most appropriate samples to 

use in this study (even 'though the most pervasive questions of bias 

deal with testing blacks and whites) for several important reasons: 



d,. - d_ 
li 2i 




PROCEDURES 



Sample . The sample used for the present study consisted of 533 
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1 # First, the study was not designed to examine CONTENT bias 
against any particular group; it was designed to test which METHODS 
are functioning properly. 

2. Second, and most importantly, the study COULD NOT answer the 
questions about the sufficiency of the methods if blacks and whites «* 
were used. The cultural groups chosen were irrelevant except for two 
important stipulations: 

A. Groups chosen had to be ^CONTROVERSIALLY roughly equal 

in ability; and 

B. Both groups had to be able to agree on which items are 
biased and unbiased. Anyone who has worked with blacks knows this 
simply does not hold. For example, some blacks may feel that all 'items 
reflect whitejniddle class culture, perhaps justifiably so. Further- 
more, rio one knows how much of the observed difference between blacks 
and whites is due to ability and how much is due to bias. 

Without agreement on which items are biased and unbiased, it 
would be difficult to separate differences due to bias and differences 
due to ability. 

Choosing groups roughly equal in ability 1 and who could agree on 
biased and unbiased items dnables us to: 



X It was thought that males/females at this university would be 
rouehly of equal ability. This assumption was checked by comparing 
observed distributions of males and females on the 60 unbiased items, 
and will be discussed later in the Results section. 
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A. Circumvent problems associated with artifacts noted by Hunter 
11975) that are due 'to distributions of unequal ability; c ' 

B. To examine the contribution of ability differences (measured 
by unbiased items) and differences due to biased items in producing 
observed total score variattions. Furthermore, we can observe how this 
affects the detection of biased items; 

C. Simulate observed black/white differences but in a situation 
where we can tell what is due to an ability difference and what dif- 
ference is due to bias. 

Research instrument development : The research instrument developed 
for use in this study was designed to measure general information. As 
a starting point, the unbiased items were constructed parallel to those 
on the general information section of the WATS. For example, instead 
of the question, "'Who wrote Faust?" one question was "Who wrote 
Catcher in the Rye? M For the items intended to be biased, samples o£ 
males : and females were asked to generate items with these directions: 
Given a male and female of equal ability on general information, give 
examples of items that a male would have .a greater probability of get- 
ting right. 

A preliminary instrument was generated consisting of 150 items (57 
biased and 93 unbiased). A sample of 32 males and 41 females was asked 
to evaluate each of the items for bias by rating each item on a 5~point 
scale (from unbiased = 1 to biased = 5) where bias was defined as above; 
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i,e., given equal ability, males have a greater probability of getting 
the item right. A different sample (37 males, 37 females.) was 'asked 
to answer the 150 questions, so that information on the difficulty and 
item to total correlations could be obtained. 

The reliability of the 150-item test in the combined sample was 
.94 (Cronbach's Alpha). 

Items were defined as biased if more than 55% of the combined male 
and female (N = 74) group gave it a 4 or 5 and there -was no significant 
difference between the male and female sample rating. (Low bias in- 
cluded 55-75%,' medium bias 75-55%, and high bias 85-95% rating it a 4 or 5. 
In addition, items were dropped if they had "p" values of greater than 
.95 or less than .05, or point biseryils less than .15 (combined sample). 
From the items surviving the above, a final instrument comprised of 
110 items (65 unbiased, 45 biased) was administered. 

Characteristics of data sets . Each of -533 males and 590 females 
took the 110-item research instrument. Six different data sets were 
constructed from the initial data base so that (1) the percentage of 
biased items could be varied and (2) the observed differences in unbiased 
score distributions could be set to one standard deviation apart. The 
purpose of this manipulation was to mirror what is often found in 
black/white scores on tests, but. in a controlled situation where the 
source of the difference (ability or bias) would be known. 

1, Percentage of biased items. The data were analyzed with 0% 
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biased items (60 items; all unbiased), 23% biased items (78 items; 
60 items unbiased, 18 items biased 2 ) and 40% biased (100 items; 60 
unbiased, 40 biased). These three amounts of bias were chosen because 
they approximate what has been found in empirical studies. For exam- 
ple, Scheuneman (1975, 1977) found 14% to be biased against blacks; 
Ironson and Subkoviak (1979) found 24% biased against blacks; Scheune- 
man (1976) found 35% to be biased across several groups; and Rudner 
(1977) found 56% biased against hearing- impaired' subjects. 

2. Observed dif_ -ences in score d istributions of males and 
females . It was felt that males and females at this university would be 
of roughly equal ability. Therefore, the first three of the six data 
sets would be generated solely by changing the proportion of biased 
items. Data sets four, five, and six would be generated by selecting 
out high ability females and low ability males so that unbiased score 
differences (on the unbiased items) would be set to one standard devia- 
tion apart. 

The assumption of equal ability males and females was checked by 
examining the distribution of males and females on the 60 unbiased 
items. The males wfere approximately one-half of a standard deviation 
above the females (X males = 33.11, S = 10.22; X females = 28.75, 



Eighteen items were chosen for bias so that items would cover a 
»p» value range (easy, medium, hard) and items would cover a bias 
range (low, medium, high). 
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SD ■ 10.41). Because of this unexpected difference, another measure 
of ability was obtained— grade point average. On this measure, females 
were about one quarter of a standard deviation above males 

males = 2.58, S = .67; X females = 2.77, S = .65). Since the dis- 
crepancies were in opposite directions, their abilities were seen as 
roughly equivalent. 

In order to create the unequal ability groups with an observed 
one standard deviation difference between males and females, the follow- 
ing procedure was used. The desired difference between males and 
females was targeted at one standard deviation, or about 10 points 
(with a standard deviation of about 10 points). This would mean moving 
the male mean up about 2-3 points and moving the female mean down about 
2-3 points, while maintaining the shapes 'of the respective distribu- 
tions. Knowing the desired mean and standard deviation of the female 
distribution, we calculated the proportions of females at each score 
level which would give this. Then females were randomly sampled 
from each score level to achieve numbers that reflect those proportions. 
We then repeated the same procedure for males. This resulted in a 
reduced sample (N = 909; 433 males, 476 females) with a one standard 
deviation difference (X males = 35.64, S = 9.14; X females = 26.20, 
S = 9.31) on the 60 unbiased items. 
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Six data sets used . Thus, the first three data sets were composed 
of roughly equal ability males and females (using the original sample) 
where the instrument was analyzed with 0%, 23%, or 40% biased items. Dat 
sets four, five, and six were composed of a reduced sample of males 
and females manipulated to be one standard deviation apart where the 
instrument was again analyzed with 0%, 23%, and 40% biased items. 

Table. 1 summarizes the characteristics of the six data sets used 
in the study. They can be placed in a continuum. Set I is the -mall 
difference in ability, no bias. Set VI is the large ability difference, 
large bias amount. The various sets in between vary in relation to the 
bias amount and the ability differences present. 
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Table 1. Characteristics of Data Sets Used in This Study. 



Data Size M/F Diff. on % Biased Number of Unbiased 



Set 

Abbre- 
viation 



I. S60 
II. S78 



533M 
590 F 



III.-S100 " 

433M 

IV. L60 47 6 F 



V. L78 
VI. L100 



60 Unbiased Items Items to Total 

Items ltems 



1/2 SD 


0 


(0/60) 


ii 


23 


(18/78) 


it 


40 


(40/100) 


1 SD 


0 


(0/60) 


it 


23 


(18/78) 


it 


40 


(40/100) 



M = males; F = females; SD = standard deviation. 

S 
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RE SULTS 

For each of the six conditions (three proportions of biased items 
with or without manipulation to ac'.ieve a one standard deviation dif- 
ference on the unbiased items), five item bias techniques were calculated 
and are described below. 

Bias Methods Used in the Analysis 

1. Transformed Item Difficulty (TID) . The distance Di from the major 
axis of the ellipse was computed for each item and used as the meas- 
ure of bias. The sign indicating direction of bias was maintained. 
A positive sign indicates an item that is relatively more difficult 
for females. 

2. One parameter item characteristic curve (1ICC): The difficulty 
parameter for each item in each group was estimated by BICAL. 

• The t statistic as described previously was used. *A positive t 
represents an item biased against females. 

3. Scheuneman. chi-square (SCHI). Scheuneman's chi-square, which con- 
siders only proportion correct for each ability level, was ob- 
tained for each item. Each item was given a sign according to 
the direction of the difference of p_ value within ability levels. 
A positive sign indicates bias against females. . 

,$. Full chi-square (FCHI). The full-chi-square, which includes both 
correct and incorrect proportions for each ability level, .was 
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obtained for each item. As in Scheuneman's chi-square, each item 
was given a sign according to the direction of the difference, 
of £ values within ability levels. A positive sign indicates bias 
against females. 

5. Three parameter item characteristic curve area (AREA) • The ICC 
estimated separately for each group by the LOGIST program. After 
linear equating, the area between the ICC for females and males 
was computed' by the formula given in Rudner, Getson, and Knight 
• (1979). A positive sign attached indicates bias against females. 
The means and standard deviations of each of the bias methods are 
given in Table 2. The most striking result of that table is a general 
trend for the bias, indices and their variability, to be markedly less in 
the conditions (I and IV) with no biased items. (Caution must be exer- 
cised in interpreting the chi-square indices 'here because only items 
othat could be divided into five ability intervals are included in this 
table. As is noted 'at the bottom of the table, this resulted in a con- 
siderable loss of items particularly when there was a high percentage 
of biased items and a large ability difference. Of additional note in 
interpreting the table is to keep in mind that the signed TID always 
sums to zero.) Of additional interest is a finding of close similarity 
in means and standard deviations across small and large ability dif- 

* 

ferences (contrasting I and IV, II and V, III and VI). Thus, the 
major contributor to differences appears to be whether there are biased 
items or not. 
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Table 2. Means and Standard Deviations of Bias Methods, 



i&ta " SQn FCHI AREA 

Set 

I-S60 



II-S78 



IV-L60. 



TID 


11LL 


Unsigned 


Signed 


.000 


-•036 


6.48 


+.047 • 


(.626) 


(2.935) 


(6.58) 


(9.097) 


.00.0 


.262 


14.322 


+3.508 


(1.10)- 


.(4.73) 


(18.688) 


(23^257) 


.000 


.279 


15.676 -,+2.378 


(1.16) 


rr on> 

♦ £.\J J 


(20.168) 


(25.287) 


t doo ' 


-.006 


S.586 


+1.975 


(.670) 


(2.75) 


(5.482) 


(7.448) 


.000 


.380 


9.328 . 


5.206" 


(1.07) 


(3.84) 


(12.156) 


(14.308) 


.000. 


'.280 


6.41 


+ .460 


(1.08) 


(4.197) 


(5.249) 


(8.01) 



Iln ^ l ovi pd 


Sicmed * 


Signed 


Unsigned 


12.907 " 

^ 1 1 ♦ O U / J 


,605 
f 17 . 1891 


..064 
(.376) 


.379 
(.218) . 


27.656 
(33.167) 


6.602 . 
(42.661) 


.113 
(.678) 


.624* 
(.387) 


30.602 
(34.124) 


4.502 
(45.44) 


.136 
. (.693) 


»609 
(.411) 


11.500 
(10.34) 


4.241 
(14.583) , 


.046 
(.386) 


.382 
(.205) 


18.082 
(22.318) 


+10.115 
(26.73) 


.292 
(.70*2) 


.597 
(.492) 


V 

13.309 
(10.705) 


2.49 
(16.57) 


.'203 
(.691) 


.583 
(.458) . 



CHI had these sample siz6s: • 

(Number of items that could not be evaluated is in parenthesis in five intervals) 



I - 54(6) II - 64 (140 

III - 79(21) IV - 52(8) 

V - 42(26) VI - 51(49) 
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Table 3. Intercorrelations Among Signed Bias Indices. 



















*FCHI 




TID 


1ICC 


*SCHI 


*FCHI 


TID 


1ICC 


*SCHI 


(I. 


S60) 








'(IV. L60) 




i 




1 TPP 


99 








.99 










.89 


.90 






.82 


.83 






FCHI 


.93 


.94 


.96 




.87 


.87 


.96 




APP A 










.82 


.82 ' 


.91 


.94 


(II. 


S78) 








(V. L78) 


> 






1 TPP 


• 








.94 








cru t 




Qf) 

• y u 






.87 


.86 






FCHI 


.93 


.92 


.98 




.89 


.88 


.98 




ARFA 


92 


94 


.92 


.95 


.87 


.91 


.91 


.93 


(III 


. SlOO) 








(VI. L100) 




1 


1ICC 


.99 








.99 








SCHI . 


.85 


.84 






.38 


.40 






FCHI 


.90 


.89 


.98 




.41 


.43 


.97 




AREA 


.92 


.92 


.90 


.93 


.82 


.82 


.44 


.47 



* Sample sizes for the CHI procedures are: 

I - 54/60; II - 64/78; III - 79/100; IV - -2/60; V - 42/78; VI - 51/100. 

This is because items that could not be evaluated in five intervals 
were dropped. 
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Agreement Among Bias Methods 

Table 3 gives the intercorrelations among the bias methods for the 
six conditions of 'the study. For the first three conditions (small 
ability difference), the intercorrelations are" for the most part in the 
90s. The percent of biased items does not appear to make a difference. 
The high agreement among methods is also apparent for conditions IV 

and V. The lower agreement for the chi 7 square -techniques in condition 

<■ \ • * - 
VI (large ability difference-) may be <\ue to the large number of items 

'which could not.be evaluated using five intervals. 

There unfortunately is no easy way of putting items that- have been 

1 

nn i, ntH vrith j ^h,- an nnYn^W4i f -A-di££erent number of inte rvals back 



on the same scale. The unsigned significance or "p" value could be used. 
These would change the correlations between SCHI and TID, 1ICC and AREA to 
.25, .26, .21; and between FCHI and TID, 1ICC and AREA to .27, .27, 
.22. These are based on an' N of 96. (Four items could not even be 
evaluated with only two 4^1ity levels.) That the correlations 
are lower using "p" values is not surprising. In the first place, cor- 
relations between the chi-square value and the "p" value of the 
significance test is only roughly .6-.8. Secondly, an earlier study 
(Rudner,.1977) also found the "p M values did not function well. 

In general, then, Table 3 shows excellent agreement among the meth- 
ods except for the sixth condition with the chi-square techniques. 
The agreement does not appear to be affected by either the percent of 
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biased items or the ability difference, except insofar as a large 
ability difference affects the computation of the chi-square. 

Agreement Between Subjective Methods of Bias 
and Statistical Bias Methods 

The correlations of bias indices with judged bias are presented in 
Table 4. For the conditions of small ability differences— I, II, and 
III, the correlation between the indices and judged bias is moderate 
(.7-. 8) when there are biased items, and low (.3-. 4) when there are 
not. For the large ability difference conditions IV and V, the same 
pattern repeats: high correlations when there are biased items (V), , 
low correlations where there are no biased utems. In examining 
condition (VI) which hasVpreponde^ce of "biased items -(40%)., TID 
and 1ICC have the highest correlation with judged bias. The lower 
correlations of the chi-square techniques may be due to the loss of 
49 items that could not be evaluated with 5 intervals. Using unsigned 
"p" values instead did not alter the correlations appreciably (.29, 
.35). The reason for the lower AREA correlation was not immediately 
apparent. One possibility that was explored and rejected was that 
lack of unidimensionality in condition IV may have harmed the AREA meas- 
ure (see the section of factor analysis results). Both the percent 
of variance and ratio of first to second eigenvalues were extremely 
stable across the six conditions. 
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Table 4. Correlation of Signed Bias Indices with Judged Bias. 





I 

S60 


II 

S78 


III 
S100 


IV 
L60 


V 
L78 


VI 

L10O 


TID 


.38 


.84 


.87 


.31 


.78 


.83 


1ICC 


.37 


.83 


.86 


.32 


.75 


.82 


*SCHI 


.47 


.78 


.66 


.46 


.80 


.24 


*FCHI 


.45 


.80 


.72 


.44 


.82 


.27 


AREA 


.42 


.84 


.77 


.43 


.83 


.67 



* CHI sample sizes are reduced; see Table 3, p. 21. 
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Agreement Between Statistical Methods and 
Biased/Unbiased Classification 

Table 5 applies cutoff values found in the literature to the identi- 
fication of the biased items. TID is biased if it is greater than 1.5 
(Strassberg-Rosenberg $ Donlon, 1975), 1ICC is biased if it is greater 
than 2.4 (Draba, 1979), CHI is biased if it is significant at the .05 
level, and AREA is biased if it is greater than .70 (Merz § Grossen, 
1978) . , • . 

Using these cutoffs, there are several trends that are apparent in 
Table 5. The first is that the cutoffs work best with a smaller propor- 
tion of biased items (18 vs. 40). The cutoffs also work better with 
the small ability difference data sets (II and III). For the large 
ability group differences, the 1ICC and FCHI seem to work best. 
Finally, the TID cutoff appears to be too low, because a lot of biased 
items are missed. 

Two additional points are essential to note in interpreting 
this table. The first is that the cutoffs were applied in only one 
direction. This means that, if the cutoff for TID was +1.5, then an 
item with a TID of -1.6 was not considered biased. Similarly, the 
"p» values were calculated for SCHI and FCHI, a sign was attached 
indicating the direction. Again, if the sign was in the opposite 
direction even if the "p" value was small, the item was not 
declared as biased. 
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Table 5. Identification if Biased Items Using Selected 
Cutoffs fox Statistical Procedures. 



Data Method 



Set 


TID 

(>1.S) 


IICC 
(>2.4) 


SCHI 
(p < .05) 


FCHI 
(p COS) 


AREA 
O .70) 


II-18biased 


- 11 


18 


13 


18 


16 


V-18biased 


9 


17 


16 


17 


16 


III-40biased 


26 


33 


26 


36 


17 


VI-40biased 


7 


29 


10 


25 


16 
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A second additional point regards the number of items incor- 
rectly identified as biased (by statistical procedures) when they 
are unbiased (by subjective judgment). For most of the conditions, 
the number was less than 5 with one notable exception: For the 60 
unbiased items (I and IV) the 1ICC incorrectly identified 12 and 13 
items, respectively, as biased. The averages over all 6 conditions 
out of a possible total of 60 unbiased items were: TID-0, 1ICC-5.2, 
SCHI-2.3, FCHI-6.5, AREA- 2. While these numbers may seem fairly 
low, if one disregards sign, the numbers increase dramatically (see 
paragraph abovl) . ' That is, many items were identified as biased 
. against males when they were intended to be unbiased items. 

Psychometric Properties of th e Tests 

Tables 6 through 9 describe the effects of varying the ability 
differences and proportion of biased items on the psychometric proper- 
ties of the test. Descriptive information is provided in Table 6. 
Tables 7 through 9 provide information on reliability, uni- 
dimensionality, and validity, respectively. 

Table 6 shows the effect of ability differences and bias amount 
' on the observed means and standard deviations. The difference in 
ability magnifies the difference between males and females 1.65 times. 
More interesting, however, is the proportion of biased items: 
increasing it to 23% magnifies the difference 1.6 times; increasing 
it to 40% magnifies the difference 2 times. Thus, both ability 
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Table 6. Observed Means and Standard Deviations for Six Conditions. 



Small Ability Difference 
Males (5331 Females (590) 

I. S60 33.11(10.22) 28.75(10.41) 

II. S78 45.11(11.91) 39.36(12.54) 

III. 100 61.74(14.91) 45.45(16.15) 



Large Ability Difference 
(Males (433) Females (476) 

IV. L60 35.64(9.14) 26.20(9.31) 

V. L78 48.03(10.42) 32.42(11.31) 

VI. L100 65.29(12.94) 41.96(14.78) 
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Table 7. Reliabilities of Tests Composed of Varying Amounts of Bias. 



Males Females 



(Small Ability Difference Groups) 

I.S60 (0% biased) .8981 .9007 

II. S78 (20% biased) .9036 .9123 
III.S100 (40% biased) .9232 .9315 

(Large, Ability Difference Groups) 

I.L60 (0% biased) .8726 .8761 

II.L78 (20% biased) .8746 .8927 

III. L100(40% biased) -8993 .9188 
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Table 8. Factor- Analysis Results (Males $ Females Combined). 

- — - ■ 'X ' 



% Variance Ex- Ratio of First 
Condition plained by to Second 

First Factor Eigenvalue 



N = 1123: 
•I. S60 



III. S100 

N = 909: 
IV. L60 
V. L78 
VI. L100 



15.8 4.18 



II. S78 14 - 6 3 * 26 



15.7 2.86 



15.4 3.94 ' 

15.1 3.83 
17.0 3.56 
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Table_9. Validities of Tests Composed of Varying Amounts of 
Bias (correlation with GPA). 





Male 


Female 


Unbiased items 




32 


Biased items 


.04 


.13 


Small Ability Differences: 






I. 0% biased 


.21 


,32 


II. 20% biased 


.18 


.29 


III. 40% biased 


.15 r 


.25 


Large Ability Differences: 


(N=354) 


(N=406) 


IV. 0% biased 


.16 


.35 


V. 20% biased 


.13 


.32 


VI. 40% biased - 


.10 


.27 



Note 1: Item- total correlations on Mf,F combined: 

Mean = .23 (s.d. = .12) for unbiased items 
J Mean = .41 (s.d. = .11) for biased items. 

Note 2: On the combined male and female sample the validity of the 
^biased items was .23; the validity of the biased items was -.02. 
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differences and the proportion of biased items had a pronounced 
effect on observed differences. 

Table 7 presents the reliabilities for the six conditions. The ( 
reliabilities are all high and do not seem to be much affected by 
either ability difference, sex, or proportion of biased items. 
Reliabilities are slightly higher for longer tests, which is what one 
would expect. 

Table 8 presents the principal components factor analysis results 
in order to investigate unidimensionality. The test appears to be 
marginally unidimensional as indicated by high reliability, and 
ratios of first to second eigenvalue of about 3. The percent of 
variance explained by the first factor was remarkably stable across 
all six conditions of the study. This was particularly surprising 
because one would expect the conditions with biased items on them to 
be multidimensional. In fact, one hypothesis for the poorer perform- 
ance of the AREA measure in condition six was that it lacked the 
unidimensionality required for a latent trait analysis. However, it 
was no more nor less unidimensional than the other conditions. Fur- 
thermore, Reckase (1979) recommends 15-20% variance explained by the 
first factor for a latent trait analysis,, 

Table 9 presents the validity data (correlation of research in- 
strument with CPA). Several findings are evident from the table. 
This test appeared to be more valid for females than for males. 
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Secondly, for both men and women, the unbiased items were more valid 
than the biased items. To check to see whether' this result may 
have been due to the unbiased items simply being better items, the 
item-total correlations were .examined. In fact, the biased items had 
higher correlations.- Third, as the proportion of biased items, went 
up, the test validity went down for both males and females (but not 
by a large amount). To hope that item bias 'studies will eliminate 
test bias is being overly optimistic. 
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CONCLUSION 

The purpose of this study was to determine .the effect that amount 
of bias and amount of ability difference would have on '(1) agreement 
among the statistical .methods, (2) agreement between the statistical 
methods and the subjective judgments of bias, (3) the psychometric 
properties of the tests such as reliability, and validity. 

Tj(ie following is a summary of the major, results: 

1. JThe agreement among the methods is very high -except for data set VI • 
(large bias and large ability difference) .for the chi-square tech- 
niques. This is very likely an "artifact of loss of items due to f 
4ri inability to use five intervals in calculating the chi-square. . 
A procedure for getting chi-squares back, on one scale needs to be . 

! developed. Otherwise the agreement was high, especially compared 

I 

to other studies in the published literature (BurriU, 1982). 

2. The correlation between statistical indices >nd judges bias was . 
moderate when there are no biased items (which one would expect due 
to restriction of range),, very high when there are biased items 
(II, III, and V) and not -as good when there are both large ability 
differences and the amount of bias is large (VT). In the latter 
case, TID and 1ICC performed best. 

3. Cutoffs for all procedures except TID appear to be working mode*, 
ately well when only 23% of the items are biased and the direction 
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of bias is taken into account. The TID procedure underidentifies 
biased items. All methods would overidentify items as biased f 
if the sign were not taken into account (due to the interaction 
nature of the methods). Studies measuring bias in both direc- 
tions most probably overestimate the amount of bias with these 
cutoffs. 

When the amount of bias goes up to 40%, all of the procedures 
miss a fair number of items especially when there are large 
ability differences. Despite this, the 1ICC and FCHI do the best 
job of identifying these items. 
4. 'The presence cf biased items increase^ the mean score differences 
between males and females, does not seem to affect reliability, 
and doe£ decrease validity (although not by a large amount). 
The major overall result is that the methods seem to be working 
well except for the last condition. It should be noted, however, that 
the last condition is rather extreme. Although there is a one 
standard deviation difference on the unbiased items, there is ap- 
proximately a two standard deviation difference in total score. 

This research points out several areas for further work. Effort 
needs to be directed toward improving cutoffs for the procedures. 
Sampling distributions need to be developed. A method needs to be 
developed for putting chi-squares calculated on different numbers of 
intervals back on the same scale. 

4" 
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It was felt that the procedufes attempting to control for 
ability would work better than those not controlling for ability 
when the ability difference was varied. This turned out not to be 
the case. The dimensionality hypothesis was [rejected as a cause for 
this. It may be, however, that the area and chi-square procedures , 
would work better with larger sample sizes. Expected frequencies 
within each interval would then be adequate. 

Finally, the practitioner may be comforted in knowing that, if 
the ability differences are 1/2 a standard deviation or less on the 
unbiased items or 23% less of the items are biased, all of the meth- 
ods agree fairly well. 

o 
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